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ABSTRACT 

This study investigates how closely different, 
interviewers agree when ranking the saae applicants, and deteraines 
the "correctness** of their rating of applicants. Interview data for 
applicants to a ledical school for two recent consecutive years were 
examined. For the first year, 573 applicants were interviewed; for 
the second year, 675 were interviewed. Twenty-six physicians were 
interviewed in the first year, and 36 interviewed in the second year, 
with 20 interviewed both years. In the first year, 146 pairs of 
interviewers interviewed the saae candidates. In year 2, 73 of the 
238 pairs were retained. Results indicated that aost interviewers 
were both reliable and "correct** or valid. Interviewers who were 
identified as candidates for unreliability were inspected to see if 
these interviewers gave ranking incongruents with the adaissions 
coaaittee, thus providing ready identification of those interviewers 
who are reliable, and whose judgments are in accord with validity 
criteria. (HJH) 
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On the Reliability and Validity of Interviewers of 
Medical School Applicants 
John C. Reid 
School of Medicine 
University of Missouri ** Columbia 

Introduction 

The selection of medical students from a large pool of applicants 
is an extraordinarily difficult process because of the large number of 
highly qualified applicants. Undergraduate grade point average (CPA) 
is the best single predictor of success in medical school (1) t and GPA 
along with scores on the Medical College Admissions Test (MCAT) are 
seriously considered by admissions committees. Even when applicants with 
lower GPA*8 and MCAT scores are rejected, the remaining number of seem- 
ingly qualified applicants is larger than the number of available posi- 
tions in medical school class. To help resolve this dilemma, most admis- 
sion committees personally interview each member in this smaller, select- 
ed pool. At the medical school in the present study, at least two phy- 
sicians, members of the admissions committee, interview each applicant. 
It would be hoped first of all that the two interviewera closely agree 
on their rating of each applicant, and second that their rating is "cor- 
rect." 

If the two interviewers agree In their rating of the applicants, 
then the two interviewers may be considered to be reliable. If the in- 
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Interviewers* Judgments are not only reliable but also ''correct" or 
"true," then their Judgment is also valid. Considering the expense of 
both candidates* and Interviewers* time and the career decisions that 
are being made, the need for determining the reliability and validity 
of interviewers* Judgments is obvious. Furthermore, the for such 

a study is strengthened by the fact that despite the admitted importance 
of the concepts of reliability and validity, few reports of these coef» 
flcients for admissions committees* interviews have appeared in the 
literature (2). 

The purpose of the present study was to investigate how closely dif- 
ferent Interviewers agree when ranking the same applicants, and also to 
determine the "correctness" of their rating of applicants* 

In particular, this study determined (1) the degree to which pairs 
of interviewers assign the same applicant the same rank, (2) whether 
each particular interviewer ranked his interviewees sitnilar to the rank 
assigned by the admissions committee working in toto (the first validity 
criterion), and (3) whether each particular interviewer's ranks cor- 
related with later student success in medic&l school (the second validity 
criterion) . 

Procedure 

Interview data for applicants to a medical school for two recent 
consecutive years (called year one and year two) were examined. For 
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year one, 573 applicants were Interviewed} for year two, 675 were in- 
terviewed. Twenty-six physicians interviewed in year one, and 38 in- 
terviewed in year two, and 20 of these interviewed both years. Physi- 
cians were non-randomly assigned to interview applicants. Interviewers 
could ask the applicant any questions they felt were useful, but inter- 
viewers specifically needed to gather data to answer questions on a 
structured interview form. Por year one, the structured form contained 
17 questions having three to five responses. For year two, the struc- 
tured form had 10 five-response questions. Typical questions on the 
form required the interviewer to rank the applicant on motivation, or 
work load outside of studies. After the interview, and upon considering 
the applicant's CPA, MCAT, and letters of recommendation, the inter- 
viewer assigned the applicant an overall evaluation rank of 5, 4, 3, 2, 
or 1 for year one, or 5, A, 3, or 1 for year two, with 5 being the 
highest. Interviewers were not aware of other Interviewer's ratings. 

In year one, 146 pairs of interviewers interviewed the same can- 
didates. Sixty-three of these pairs interviewed more than three ap- 
plicants and were retained in the reliability analysis. In year two 
73 of 238 pairs were retained. The same interviewer typically inter- 
viewed applicants in common with five or six other Interviewers. All 
interviewers rating more than three applicants were retained in the 
validity analysis; 19 out of 26 satisfied this criterion in year one and 
30 out of 38 satisfied this in year two. Although a statistic based 
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on a small number ox obaervatlons may not Justify a hard decision » it 
may call attention to a need for more information. The median number 
of applicants ranked by these Interviewers was 37 .5 and 38 for the two 
years. 

It Is worthwhile to review the method of determination of reli^ 
ability and validity coefficients before proceeding to the results of 
the study. Had all interviewers interviewed all applicants ^ then such 
multivariate methods of reliability as proposed by Cronbach et^ al. ^ (3) 
would be applicable. However ^ since Interviewers only interviewed a few 
applicants each, and that non--randomly, the applicant by interviewer 
data matrix is somewhat similar to those matrices discussed by Shoemaker 
(4) , except that procedures for estimation of such a varlmce-^covariance 
matrix have not been worked out beyond some important preliminary work 
by Tlmm (3) and Chan (6) . 

To determine the agreement between each pair of interviewers rank- 
ing the same applicants, four distance formulas were computed: a 
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Euclidian distance D (7), its square root D, a distance which was the 
sum of absolute values of deviations between interviewers, and a dis~ 
tance which was the sum of a (0,1) loss function. The binary loss func** 
tlon was defined as 0 unless the rank differences between interviewers 
exceeded unity. For each pair of interviewers, all four distance func- 
tions were divided by the number of applicants rated by the palr« 

It is Important to realize that statistics such as correlation and 



anova destroy some of the Information that the distance functions ve^ 
tain aid therefore would be less valuable as reliability coefficients 
In this study than measures of distance to measure profile similarity 
of Interviewers. Two examples may suffice to illustrate this point* 
Two interviewers may evaluate applicants Tom» Dick and Harry as follows: 
Tom: 5» 3; Dick: 3» 3; Harry: 1» 3. An anova would produce no signifi- 
cant differences between the mean ratings of the two interviewers » yet 
the Interviewers clearly rate the three applicants differently. A sec- 
ond example: Tom: 5^ 3; Dick: 2; Harry: 3» 1. A correlation would 
equal unity but would destroy the important mean differences between in- 
terviewers. 

Although the distance functions do not have a limited range » as 
does the correlation coefficient » the retention of raw score units is 
an interpretative advantage » rather than a disadvantage. 

Cronbach and Gleser (7) and Rulon et al . (8). have discussed the 
similarity between D^^ Mahalanobls D» and the discriminant function. 

For a pair of interviewers to have a high inter-distance value » 
that ls» to disagree on their ratings of the same people » one or both 
Interviewers could have made errors In Judgment. It could be the fate 
of a "correct*^ Interviewer to be paired with an '^erroneous** interviewer. 
Therefore, each member of a.i interviewer pair having a mean distance 
function value » D» of >.71 was regarded as a candidate for unreliabil- 
ity, since such a D value indicated that these interviewers would on 
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the avarage evaluate a candidate more than half a category apart « 

To determine whether Interviewers^ evaluations were '^correct" 
(valid) using the first criterion^ rank order correlations rather than 
distances were calculated between the Interviewers^ rankings and the 
committee rankings, since the two rankings were on different Instru-* 
ments« (A slight error occurs with this scheme because each inter- 
viewer is a member of the admissions committee* The error ie similar 
to that in an ltem-«*total score correlation.) 

Interviewers having rank^order correlations (rho) of <«6 vlth the 
admissions committee final rating were regarded as not valid; inter- 
viewers having rank-order correlations of >^«6 were regarded as valid* 

Interviewers can be thought of as being in one cell of a 2 x 2 
tablet the columns of which are labeled candidates for unreliability 
(yes, no), and the rows of which are labeled satisfactory validity 
(yes, no)« Decisions about what to recommend for interviewers in each 
of the four cells will now be discussed. 

Interviewers who were classified as not being a candidate for un- 
reliability and who also had high validity coefficients should be re- 
tained on the admissions committee. 

Interviewers who were classified as not being a candidate for un- 
reliability but who had low validity coefficients should probably have 
their performance reviewed. However, if applicants who these inter- 
viewers would have turned down were accepted into medical school and 
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had difficulty, then the admissions committee Is erroneously Ignoring 
the Insights of these Interviewers. 

Interviewers who were classified as being an candidate for unreli- 
ability and who also had satisfactory validity coefficients can proba- 
bly be retained on the admissions coinmlttee. A useful statistic to 
help in making this decision derives from the fact that each inter- 
viewer typically interviewed applicants in common with five or six other 
interviewers, and a distance for each pair can be computed. This 
statistic is the ratio of the number of times an interviewer was paired 
with an Interviewer having disparate ratings (symbolized by U for un- 
like) to the total number of interviewers (T) he was paired with. If 
the U/T ratio was 1, theri the iutfcivlewer disagreed with eveiy otuei; 
interviewer he was paired with. If the U/T ratio was 0, then the inter- 
viewer agreed with every other interviewer he was paired with, and 
would not be a candidate for unreliability. It is possible, of course, 
for a set of Interviewers to agree with each other, yet they all be 
erroneous ("incorrect" or invalid). If the U/T ratio is >.5, then the 
interviewer may not be sufficiently stable in his judgments to warrant 
retention on the admissions committee without further training. 

Finally, interviewers who were both candidates for unreliability 
and also had low validity coefficients probably should be dropped from 
the admissions committee, particularly if their recommendations are 
not substantiated by later student performance in medical school. 
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The criterion of the decision of the total admissions coomittec 
Is valuable because It can be computed using all applicants to medical 
school, not Ju^t accepted students. As time goes on, though, a second 
criterion of performance in medical school becomes available, which 
necessarily derives from a smaller group. For each interviewer for 
each year, rank order correlations (rho) were computed between inter- 
vicwer»s rating, the mean number of times a student got honor« in a 
course, mean delayed grades, mean subjective rating as a house officer, 
and for year one students, the NBME part I total score. Interviewers 
having all four correlations positive for accepted students were judged 
valid on the second criterion; interviewers having one or more of the 
four correlations negative were judged invalid on the second criterion. 
For those interviewers whose ratings had been judged as invalid using 
the first criterion of admissions committee cieclslon, the progress of 
the particular students they rated was examined to see if that inter- 
viewer's initial rating was substantiated by that students* progress in 
medical school. 

Results 

Rank-order correlations between the least squares, absolute value, 
and loss distance formulas were obtained. Correlations between least 
squares and absolute value distances were .99 and ,91 for years one and 
two, between least squares and loss were .80 and ,81, and between abso- 
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lute value and loss were .75 and .60. Differences in decisions based on 
the use of differing distance functions will not be further discussed 
here. 

For year one, of all candidates for unreliability, 40% had Inter- 
viewer-committee correlations <.6 (were not valid), and 60Z had Inter- 
vlewer-comlttee correlations of >^.6 (were valid). Of the 60% who had 
valid, but possibly unreliable ratings, only one had a U/T ratio of 
>^.5; most had U/T ratios of .2 or .1. A (U/T) ratio of .5 means that 
that Interviewer disagreed with half of the other Interviewers who rated 
the same applicants. One Interviewer had a validity coefficient of <.6, 
but he had not been Identified as a candidate for mrellablllty because 
he had fewer than four Interviews In common with any other Interviewer. 

For year two, of all candidates for unreliability, half had Inter- 
vlewer-commlttec^correlatlons <.6 (were not valid), and half had cor- 
relations >..6. Of this latter half, only 2 had U/T ratios >..5. Of the 
Interviewers who had unsatisfactory validity coefficients but who were 
not candidates for unreliability, half had fewer than four Interviews In 
common with any other Interviewer, and thus would not have been Iden- 
tified as a candidate for unreliability. 

The design of a longitudinal study permits the comparison of ear- 
lier performance with later performance. If u Judge *s performance is 
constant across time, then increasingly greater confidence is obtained 
that the Judge is being correctly classified. Interviewers who remain 
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reliable and valid across years should be retained on the admissions com- 
mittee; those who remain unreliable and not valid across years might be 
given another assignment. In the present study, two Interviewers re- 
mained In the non-valid category for both year one and year two. 

It should be m cloned that certain students were re-interviewed 
when the committee could not reach agreement. These re-interviews were 
not Included in the present analysis. 

Data from those interviewers identified as invalid by the first 
criterion of admissions committee decision were examined to see if stu- 
dents whoa they rated low did poorly in medical school, and if students 
whom they rated high did well. Two examples will illustrate possible 
outcomes. Table 1 (a) was produced by Dr. X, 1 (b) by Dr. Y. 

Students were classified on whether they got a mean course rating 
of A or higher on a 7-polnt scale. A student doing this well is rarely, 
if ever, in trouble in that course. 

If a student was rejected by interviewer X, the probability was 
.57 that he would do satisfactory work in medical school. If a student 
was accepted by interviewer X, the probability was .67 that he would do 
satisfactory work in medical school. 

Thus, although interviewer X was classified by the first criterion 
as an invalid Interviewer, he may, on balance, be marginally acceptable. 

If a student was rejected by interviewer Y, the probability was one 
that he would do satisfactory work in medical school. If a student was 
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accepted by Interviewer Y, the probability was ,33 that he would do sat- 
isfactory work in medical school. Thus, although these data for Y are 
based on only 8 students, the data support the original classification 
of Y as an invalid interviewer. Similar analyses were done on honors, 
delayed or falling grades, and NBME - 1 scores, although they are not 
reported here. 

Summary and conclusions 

In this two-year study, the evaluation rankings given by inter- 
viewing physicians to 1391 medical school applicants were investigated 
for similarity of rankings between interviewers (reliability), for simi- 
larity of judgments between interviewers and the admissions committee 
(validity), and for similarity of interviewers* judgments and later stu- 
dent performance. 

Most interviewers were both reliable and "correct" or valid as 
operationally defined herein. Ratings by interviewers who were identi- 
fied as candidates for unreliability were also inspected to see if these 
Interviewers gave rankings incongruent with t^Q admissions committee. 

Three reasons could account for the rankings of those interviewers 
classed as candidates for unreliability who also have low validity coef- 
ficients. The first reason could be any kind of error such as inter- 
viewer, instrumental, recording, or interviewer-interviewee interaction. 
Some of this error could be decreased by the review of Interviewing 
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principles. The iecond could be that these Interviewers correctly per- 
ceived some positive or negative trait of the interviewee that others 
failed to sec or failed to be persuaded of. The t:hlrd could be that 
interviewers are not rating applicants on traits relevant to medical 
school performance. 

Attrition occurred in computing both reliability and validity 
indices because some Interviewers rated only a few applicants. This 
attrition could be reduced in future studies if interviewers were re- 
quired to interview at least 20 applicants. 

The method described permits ready identification of those inter- 
viewers who are reliable, and whose Judgments are In accord with 
validity criteria. It is most Important for an institution to be aware 
of the reliability and validity of one of the major portions of the 
admissions process and to rectify any correctable components once they 
have been identified. 



Table I 



Data from Interviewers classified as invalid 
on the criterion of admissions committee decision 



<a) 



Interviever X*s 


Number of Students 


Number of Students 


rat In i 


Rated 


Getting >A in 






Mean Course Ratings 


Acceptable 
Unacceptable 


12 


8 


7 


4 



19 



12 



(b) 



Interviewer Y's 
rating 



Acceptable 
Unacceptable 



8 



14 
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